Course: Applied Geo-data Science at the University of Bern (Institute of Geography)
Supervisor: Prof. Dr. Benjamin Stocker
Adviser: Dr. Koen Hufkens, Pepa Aran, Pascal Schneider
Further information: https://geco-bern.github.io/agds/
Do you have questions about the workflow? Contact the Author:
Author: Bigler Patrick (patrick.bigler1@students.unibe.ch)
Matriculation number: 20-100-178
Reference: Report Exercise 5 (Chapter 10)
Here, we explore the role of structure in the data for model generalisability and how to best estimate a “true” out-of-sample error that corresponds to the prediction task. The task here is to train a model on ecosystem flux observations from one site and predict them for another site (spatially upscaling).
Concept of cross validation
spatially upscalling
The theory is very similar to re_ml_01. We change only our method to cross-validation.
The open-source program R-Studio (Version 2022.12.0+353) was used for all studies. The data handling in this program is done using the R language. We utilize the R-package “Tidyverse” and other packages to process the data effectively.
The following code chunk contains all packages we need. Important is the package “conflicted”. It enables us to choose if different functions have the same call, but do not make the same thing (a function in a certain package can have the same name as a function from another package). In this case, we will set preferences.
source("../../R/general/packages.R")
First, we get the data and save it in our repository. But we must do this only once. If the file exists, then we can read it directly. That is why we implemented an if-else statement.
name.of.file <- "../../data/re_ml_02/daily_fluxes.davos.csv"
# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
# Access to the data
url.1 <- "https://raw.githubusercontent.com/geco-bern/agds/main/data/FLX_C
-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3.csv"
# Read in the data directly from URL
daily_fluxes.davos <- read.table(url.1, header = TRUE, sep = ",")
# Write a CSV file in the respiratory
write_csv(daily_fluxes.davos, "../../data/re_ml_02/daily_fluxes.davos.csv")
# Read the file
daily_fluxes.davos <- read_csv("../../data/re_ml_02/daily_fluxes.davos.csv")
# If exists such a file, read it only!
}else{daily_fluxes.davos <- read_csv("../../data/re_ml_02/daily_fluxes.davos.csv")}
name.of.file <- "../../data/re_ml_02/daily_fluxes.laegern.csv"
# If do not exists such a file, create it!
if (!file.exists(name.of.file)){
# Access to the data
url.2 <- "https://raw.githubusercontent.com/geco-bern/agds/main/data/FLX_CH
-Lae_FLUXNET2015_FULLSET_DD_2004-2014_1-4.csv"
# Read in the data directly from URL
daily_fluxes.davos <- read.table(url.1, header = TRUE, sep = ",")
# Write a CSV file in the respiratory
write_csv(daily_fluxes.laegern, "../../data/re_ml_02/daily_fluxes.laegern.csv")
# Read the file
daily_fluxes.laegern <- read_csv("../../data/re_ml_02/daily_fluxes.laegern.csv")
# If exists such a file, read it!
}else{daily_fluxes.laegern <- read_csv("../../data/re_ml_02/daily_fluxes.laegern.csv")}
We can see that some columns contain -9999 as a value. Our quality function changed that to NA. Then we use ymd() from the “lubridate” package to rewrite the date in a proper way. Further, we want only columns that contain good quality. For that, we check selected columns against their quality control columns. If the proportion of good measurement is less than 80%, then we overwrite the value with NA (and do not drop the row!). Now we know that our data has high quality, and we can perform our analyses with it.
Here, we work with two locations (Davos and Lägern). But we do not want to do everything twice. That is why we define a new data frame that contains all the information about the two locations. We use .id = “id” because we must know where the values come from. If we know that, we can use the filter function from the package “dplyr” to efficiently analyze our data.
# Load the function in the file (we use the function from another markdown again)
source("../../R/re_ml_01/function.use.good.quality.only.R")
# Function call to clean the data of Davos
daily_fluxes.davos <- use.good.quality(daily_fluxes.davos)
# Function call to clean the data of Lägern
daily_fluxes.laegern <- use.good.quality(daily_fluxes.laegern)
# We create a new data frame with bind_rows. We us "id" as an identifier.
daily_fluxes_both <- bind_rows(daily_fluxes.davos, daily_fluxes.laegern, .id = "id")
# We check the again. Now our dataset contains NAs and only the columns of interest
# Load the function
source("../../R/general/function.visualize.na.values.R")
# Function call
visualize.na.values.without.groups(daily_fluxes_both)
Here, we split our data into a training set and a test set. First, we will split the data (80% training and 20% testing). We set the seed for a pseudo-random choice (reproducibility).
# For reproducibility (pseudo-random)
set.seed(123)
# Split 80 % to 20 %
split_davos <- rsample::initial_split(daily_fluxes_both|>
dplyr::filter(id == 1) ,
prop = 0.8, strata = "VPD_F")
split_laegern <- rsample::initial_split(daily_fluxes_both|>
dplyr::filter(id == 2),
prop = 0.8, strata = "VPD_F")
split_both <- rsample::initial_split(daily_fluxes_both, prop = 0.8, strata = "VPD_F")
# Split Davos
daily_fluxes_davos_train <- rsample::training(split_davos)
daily_fluxes_davos_test <- rsample::testing(split_davos)
# Split Lägern
daily_fluxes_laegern_train <- rsample::training(split_laegern)
daily_fluxes_laegern_test <- rsample::testing(split_laegern)
# Split pooled
daily_fluxes_both_train <- rsample::training(split_both)
daily_fluxes_both_test <- rsample::testing(split_both)
We use a sequence to determine the optimal k. The model with an optimal k has the smallest mean absolute error (MAE). But we are also interested in other metrics, like RSQ. Our function will read RSQ as well. We created three models.
We can see that the optimal k is 59. This seems odd because the data are the same as for the exercise re_ml_01 and there was our optimal k = 25. But we have to consider that we used a different approach to create our model. We use “cross-validation,” which is why we find different optimal k for the same data (and the same set.seed).
source("../../R/re_ml_01/function.parameter.extracter.R")
source("../../R/re_ml_01/function.train.and.test.R")
# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5), 59)
# Visualize the MAE and RSQ --> optimal k is 25 (between 20:30)
parameter.extracter(my.sequence, daily_fluxes_davos_train, daily_fluxes_davos_test)
We make the same thing for Lägern. Here is optimal k equal to 25. We visualized the main metrics again.
source("../../R/re_ml_01/function.parameter.extracter.R")
source("../../R/re_ml_01/function.train.and.test.R")
# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5))
# Visualize the MAE and RSQ --> optimal k is 25 (between 20:30)
parameter.extracter(my.sequence, daily_fluxes_laegern_train, daily_fluxes_laegern_test)
We do the same thing for the pooled data, and we find an optimal k equal to 38.
source("../../R/re_ml_01/function.parameter.extracter.R")
source("../../R/re_ml_01/function.train.and.test.R")
# Define a sequence for k. Use 1,2,3,4 to show the curve at the beginning
my.sequence <- c(1, 2, 3, 4, seq(5, to = 100, by = 5))
# Visualize the MAE and RSQ --> optimal k is 25 (between 20:30)
parameter.extracter(my.sequence, daily_fluxes_both_train, daily_fluxes_both_test)
parameter.extracter(c(30:40), daily_fluxes_both_train, daily_fluxes_both_test)
Now we know the optimal k for each model, and we recalculate them…
source("../../R/re_ml_02/function.knn.cv.model.R")
# Function call for Davos
knn.model.davos.optimal <- knn.cv.model(daily_fluxes_davos_train, 59, 10)
# Function call for Lägern
knn.model.laegern.optimal <- knn.cv.model(daily_fluxes_laegern_train, 25, 10)
# Function call for Davos and Lägern
knn.model.both.optimal <- knn.cv.model(daily_fluxes_both_train, 38, 10)
…and visualize our findings.
# Load the function
source("../../R/re_ml_01/function.evaluation.model.R")
# Model Davos
P_1 <- eval_model(knn.model.davos.optimal, daily_fluxes_davos_train,
daily_fluxes_davos_test,
c("Davos (opt. k = 59)"), c("Davos (opt. k = 59)"))
P_2 <- eval_model(knn.model.davos.optimal, daily_fluxes_davos_train,
daily_fluxes_laegern_test,
c("Davos (opt. k = 59)"), c("Lägern (opt. k = 59)"))
P_3 <- eval_model(knn.model.davos.optimal, daily_fluxes_davos_train,
daily_fluxes_both_test,
c("Davos (opt. k = 59)"), c("Davos and Lägern(opt. k = 59)"))
# Model Lägern
P_4 <- eval_model(knn.model.laegern.optimal, daily_fluxes_laegern_train,
daily_fluxes_davos_test,
c("Lägern (opt. k = 25)"), c("Davos (opt. k = 25)"))
P_5 <- eval_model(knn.model.laegern.optimal, daily_fluxes_laegern_train,
daily_fluxes_laegern_test,
c("Lägern (opt. k = 25)"), c("Lägern (opt. k = 25)"))
P_6 <- eval_model(knn.model.laegern.optimal, daily_fluxes_laegern_train,
daily_fluxes_both_test,
c("Lägern (opt. k = 25)"), c("Davos+Lägern (opt. k = 25)"))
# Model Davos und Lägern
P_7 <- eval_model(knn.model.both.optimal, daily_fluxes_both_train,
daily_fluxes_davos_test, c("Davos and Lägern (opt. k = 38)"),
c("Davos (opt. k = 38)"))
P_8 <- eval_model(knn.model.both.optimal, daily_fluxes_both_train,
daily_fluxes_laegern_test, c("Davos and Lägern (opt. k = 38)"),
c("Lägern (opt. k = 38)"))
P_9 <- eval_model(knn.model.both.optimal, daily_fluxes_both_train,
daily_fluxes_both_test, c("Davos and Lägern (opt. k = 38)"),
c("Davos+Lägern (opt. k = 38)"))
# Plots
cowplot::plot_grid(P_1, P_2, P_3, ncol = 1)
cowplot::plot_grid(P_4, P_5, P_6, ncol = 1)
cowplot::plot_grid(P_7, P_8, P_9, ncol = 1)
The function is very flexible. If we do not specify which months we want, then the function will use all months. Now we can analyze whether there are any patterns. First, we take a look at the whole year. We can see that the model underestimates the GPP values in Davos. For Lägern, the model overestimates the GPP value. To explain that, we need a higher resolution (monthly). We take a closer look at the seasons (DJF, MAM, JJA, SON). The function calculates the difference between the measured GPP and the fitted GPP.
# Load the function into the file
source("../../R/re_ml_02/function.vis.residue.R")
# Residue for Davos: train: pooled, test: Davos (year)
time.variation.year(knn.model.both.optimal, daily_fluxes_davos_test,
c("Train: Pooled, Test: Davos, k = 38"),
c("re_ml_2 (Chapter 10)"))
# Residue for Davos: train: pooled, test: Davos (monthly)
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_davos_test,c("KNN-Model:"),
c("Train: Pooled, Test: Davos, k = 38"),
c("re_ml_02 (Chapter 10)"))
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_davos_test,c("KNN-Model:"),
c("Train: Pooled, Test: Davos, k = 38"),
c("re_ml_02 (Chapter 10)"), c("December", "January", "February"))
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_davos_test,c("KNN-Model:"),
c("Train: Pooled, Test: Davos, k = 38"),
c("re_ml_02 (Chapter 10)"), c("March","April", "May"))
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_davos_test,c("KNN-Model:"),
c("Train: Pooled, Test: Davos, k = 38"),
c("re_ml_02 (Chapter 10)"), c("June","July", "August"))
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_davos_test,c("KNN-Model:"),
c("Train: Pooled, Test: Davos, k = 38"),
c("re_ml_02 (Chapter 10)"), c("September","October", "November"))
We do the same for Lägern.
# Residue for Läger: train: pooled, test: Lägern (year)
time.variation.year(knn.model.both.optimal, daily_fluxes_laegern_test,
c("Train: Pooled, Test: Lägern, k = 38"),
c("re_ml_2 (Chapter 10)"))
# We predict Lägern with pooled
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_laegern_test,c("KNN-Model:"),
c("Train: Pooled, Test: Lägern, k = 38"),
c("re_ml_02 (Chapter 10)"))
# We predict Lägern with pooled
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_laegern_test,c("KNN-Model:"),
c("Train: Pooled, Test: Lägern, k = 38"),
c("re_ml_02 (Chapter 10)"),c("December","January", "February"))
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_laegern_test,c("KNN-Model:"),
c("Train: Pooled, Test: Davos, k = 38"),
c("re_ml_02 (Chapter 10)"), c("December", "January", "February"))
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_laegern_test,c("KNN-Model:"),
c("Train: Pooled, Test: Davos, k = 38"),
c("re_ml_02 (Chapter 10)"), c("March","April", "May"))
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_laegern_test,c("KNN-Model:"),
c("Train: Pooled, Test: Davos, k = 38"),
c("re_ml_02 (Chapter 10)"), c("June","July", "August"))
time.variation.ml.02(knn.model.both.optimal,
daily_fluxes_laegern_test,c("KNN-Model:"),
c("Train: Pooled, Test: Davos, k = 38"),
c("re_ml_02 (Chapter 10)"), c("September","October", "November"))
We have merged two data sets. But the two locations are very different.
| Quantity | Davos | Lägern |
|---|---|---|
| Coordinates1: | 46.81°N, 9.84°E | 47.48°N, 8.4°E |
| Elevation1: | 1594 m.a.s.l. | 873 m.a.s.l. |
| Exposure1: | South-East slope | Summit |
| 2m Temperature1 (1990-2020) | 3.8°C - 5.3°C | 6.2°C - 9.4°C |
| Precipitation1 (1990-2020 | 1046 mm per year | - |
| Hours of sunshine1 (1990-2020) | 1537 h - 2076 h | 1462 h - 2152 h |
| Wind1 (1990-2020) | 6 .8 km/h - 10.4 km/h | 15 km/h - 18 km/h |
| Vegetation2: | Coniferous forests | Deciduous forests |
We tried different approaches to modeling the GPP. We took some metrics into account (e.g., RSQ, MAE) and optimized k. Our goal was to create a model that is generalizable. That means that we want a model that can predict the same variable (GPP) in different locations (Davos, Lägern). After we found the probably best model, we applied the model to predict the test data. The difference between predicted test data and test data is called the bias. With this parameter, we are able to discuss our analysis.
A very important parameter is the bias. We can see that our model underestimates the GPP for Davos. If we group the data by month, then we can see that the model underestimates every month. However, not every month has the same dispersion. In the winter season (DJF) the dispersion is smaller than in the spring (MAM).
The bias for Lägern is positive. Therefore, the model overestimates the GPP in Lägern. But if we take a monthly resolution, we can see that this is true, but the amplitude varies. For example, overestimate the model GPP in summer (JJA) by almost \(1.8\:\mu mol*m^{-2}*s^{-1}\) and for all other months by almost zero.The reason for that is not clear. It could be the elevation. In Davos, the dispersion starts in April. The later snowmelt could be responsible for this. In Lägern not so much snowfall is measured. The dispersion starts in February. That means that the vegetation in Lägern was earlier “active” than in Davos.
Another explanation could be exposure. In Lägern, the station is at a “summit”. That means there is more wind. Maybe the wind falsifies the measurement (but we do not think that because that would be very unprofessional).
The vegetational context could also be a explanation. Decidious
(1) Meteorological Data
(2) Vegetation